Performance of Predictive Models - The Interpretability and Explainability

Authors: Leona Hasani, Leona Hoxha, Nanmanat Disayakamonpan, Nastaran Mesgari

1 Project Overview

1.1 Introduction

Our projects takes into consideration three different datasets from Kaggle, each one of them from different industries: Cardiovascural Dataset from the health industry, Weather in Australia from the environmental industry, and Hotel Reservation from the business industry.

The objective of our project is to assess the performance of various supervised learning algorithms in predicting binary target variables. In the next section, we outline the key questions guiding our project, which we will answer throughout the project and at results and key findings.

Moreover, we aim to examine how the most effective supervised machine learning algorithm learns within a given dataset. To achieve this, we will utilize learning curves, which provide insights into the algorithm’s performance as it processes more training data. Additionally, we will devote significant attention to hyperparameter tuning to optimize model performance. By adjusting these parameters, we seek to identify any potential overfitting issues within the datasets. This analysis will involve visualizations showcasing the training and testing performance metrics across various hyperparameter settings.

The primary goal of this project is to enhance our understanding of supervised predictive models, with particular emphasis on overfitting. Overfitting is a complex concept that can be challenging to grasp, often leading to misconceptions. By delving into this topic, we aim to clarify its nuances and implications within the context of machine learning models. Through thorough examination and visualization of performance metrics, we aim to shed light on the factors contributing to overfitting and strategies for mitigating its effects.

1.2 Questions and Problems

In our project, we delve into a series of questions and challenges aimed at enhancing our model’s performance and interpretability. We prioritize the questions based on their significance and relevance as follows:

1. Can the implementation of more sophisticated modeling methods within our dataset lead to enhanced model performance, and how can we interpret such improvements?

2. Does it mean that if one model performs the best in one particular dataset, it would be the same for another dataset with the same method?

3. What is the impact of standardization and normalization techniques on the performance scores of our models?

4. Do we have any imbalanced dataset? If yes, what approach could we use to balance the data?

5. How can we analyze the trade-off dynamics between including all available features and employing feature selection techniques?

6. What approach can be employed to identify the optimal hyperparameters of specific models?

7. Is there a risk of overfitting within our datasets, and what measures can be taken to assess and mitigate this risk effectively?

After the preprocessing steps, exporatory data analysis, and modelling part, at the results and the conclusion part we will try to answer each one of the research questions that are listed above.

1.3 Core Methodology and Additional Elements

As our project delves into predictive models and their interpretability, we aim to provide concise explanations of each model. Additionally, we emphasize the importance of exploring additional techniques to enhance model performance and assess the risk of overfitting in our datasets. Therefore, we offer an overview of our core methodology and additional techniques employed in this project.

1.3.1 Resampling (Random Undersampling)

In many fields like healthcare, imbalanced datasets are common, where one class is much more prevalent than others. This can lead to biased models favoring the dominant class (Bach et al., 2019). One approach to address this is resampling, which involves adjusting the dataset to achieve a more balanced distribution through undersampling the majority class, oversampling the minority class, or a hybrid of both (Snieder et al., 2020). Undersampling, where the majority class is reduced, is suitable for our project, given the lower prevalence of heart disease compared to healthy cases. We’ll use an 80:20 undersampling ratio to strike a balance between improving the model’s ability to detect heart disease and maintaining a dataset representative of real-world distributions (Yanminsun et al.,2011).

1.3.2 Feature Selection: KBest

SelectKBest is one of the univariate feature selection methods. SelectKBest then selects the top k features with the highest scores, indicating they are the most relevant for predicting the target variable (Nair & Bhagat, 2019). Therefore, it helps focus on the task’s most important features, making the dataset more manageable and potentially improving the machine learning model’s performance.

1.3.3 Model Performance Metrics

Due to the characteristics of our target variables in all three datasets, we have to employ classification models, therefore the evaluation metrics will offer a quantitative assessment of how well the models perform (Programmer, 2023).

1.3.3.1 Accuracy

Accuracy is the most used performance metric for evaluating a binary classification model. It measures the proportion of correct predictions made by the model out of all the predictions. A high accuracy score indicates that the model is making a large proportion of correct predictions, while a low accuracy score indicates that the model is making too many incorrect predictions.

Accuracy is calculated using the following formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)

where TP represents the number of true positives, TN the number of true negatives, FP the number of false positives, and FN is the number of false negatives(Programmer, 2023).

1.3.3.2 Precision

Precision is a metric that measures the proportion of true positives (TP) among the total that are predicted as positive by the model. In other words, precision measures the accuracy of the positive predictions made by the model. A high precision score indicates that the model is able to accurately identify positives, while a low precision score indicates that the model is making too many false positive (FP) predictions.

Precision is calculated using the following formula: Precision = TP / (TP + FP)

where TP is the number of true positives and FP is the number of false positives (Programmer, 2023).

1.3.3.3 Recall

Recall, also known as sensitivity or true positive rate (TPR), is a performance metric that measures the proportion of positives that are correctly identified by the model out of all the actual positives. In other words, recall measures the model’s ability to correctly identify positives. A high recall score indicates that the model is able to identify a large proportion of positives, while a low recall score indicates that the model is missing many positives.

Recall is calculated using the following formula: Recall = TP / (TP + FN)

where TP is the number of true positive instances and FN is the number of false negative instances (Programmer, 2023).

1.3.3.4 F1-score

F1-score is a performance metric that combines precision and recall to provide a comprehensive evaluation of the performance of a binary classification model. It measures the harmonic mean of precision and recall, giving equal importance to both metrics. A high F1-score indicates that the model is performing well in both precision and recall, while a low F1-score indicates that the model is not performing well in either precision or recall (Programmer, 2023).

F1-score is calculated using the following formula: F1-score = 2 * (precision * recall) / (precision + recall)

where precision is the proportion of true positive cases among all the cases predicted as positive, and recall is the proportion of true positive cases among all the actual positive cases.

1.3.3.5 AUC-ROC curve

The ROC(Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classifier, indicating the tradeoff between true positive rate (TPR) and false positive rate (FPR) at different thresholds. The AUC represents the area under this curve, which ranges from 0 to 1, with a higher AUC indicating better model performance (Programmer, 2023).

The TPR and FPR are defined as follows: True Positive Rate (TPR,Sensitivity) = True Positives / (True Positives + False Negatives) False Positive Rate (FPR,Specificity) = False Positives / (False Positives + True Negatives)

1.3.4 Models

In each of the datasets, we’ve applied six different classification algorithms. These algorithms are used to predict outcomes that are either true or false. The goal is to determine which model performs best in terms of accuracy, precision, and other performance measures for each specific dataset. This helps us understand which algorithm is most effective for a given dataset and prediction task. Before proceeding with the analysis and the project, it’s essential to grasp the functioning and construction of each model. Understanding each model’s mechanics provides insight into how it makes predictions and its underlying assumptions. This comprehension enables us to interpret the results more effectively and choose the most suitable model for our specific dataset and problem. All the classification models have been used from the sklearn library.

1.3.4.1 Logistic Regression Classifier

Logistic regression predicts the likelihood of an event based on independent variables, making it valuable for classification tasks. By transforming odds into probabilities, it generates predictions bounded between 0 and 1. Coefficients are optimized through maximum likelihood estimation, allowing for efficient prediction (IBM, 2022).

1.3.4.2 Decision Tree Classifier

A decision tree is a type of algorithm used in machine learning for tasks like sorting data into categories or making predictions. It’s like a flowchart, starting with a main question (the root node) and then branching out based on different answers (branches) to eventually reach final conclusions (leaf nodes). It’s designed to divide data into smaller, more manageable groups by making decisions at each step. The goal is to create simple, easy-to-understand rules that accurately predict outcomes. Decision trees can get complex as they grow, so techniques like pruning (removing unnecessary branches) and using ensembles (groups of trees) help keep them accurate and efficient (IBM, 2023).

1.3.4.3 Random Forest Classifier

A random forest is a machine learning algorithm that combines the outputs of multiple decision trees to make predictions. By using a collection of decision trees and injecting randomness into the process, random forests reduce the risk of overfitting and improve accuracy. Each tree in the forest is built on a subset of the data and a subset of features, resulting in a diverse set of trees that work together to provide more accurate predictions (IBM, 2023b).

1.3.4.4 Gradient Boosting Classifier

Gradient boosting is a powerful machine learning technique that combines weak learners, typically decision trees, into a strong predictive model. It operates by sequentially adding trees to correct the errors of the previous ones, using a gradient descent approach to minimize a chosen loss function. This method, marked by its flexibility and ability to handle various types of data, is enhanced through techniques like tree constraints, shrinkage, random sampling, and penalized learning, which mitigate overfitting and enhance predictive accuracy (Jason Brownlee, 2018).

1.3.4.5 KNeighbors Classifier

The K-Nearest Neighbors (KNN) classifier is a type of supervised learning algorithm used for classification tasks. It makes predictions based on the similarity of input data points to the known data points in the training dataset. By creating neighborhoods in the dataset, KNN assigns new data samples to the neighborhood where they best fit. KNN is particularly effective when dealing with numerical data and a small number of features, and it excels in scenarios with less scattered data and few outliers (Alves, 2021).

1.3.4.6 Adaboost Classier

AdaBoost, short for Adaptive Boosting, is a powerful ensemble learning algorithm that combines multiple weak classifiers to create a strong predictive model. Its main idea involves iteratively training weak classifiers on different subsets of the training data, assigning higher weights to misclassified samples in each iteration. By focusing on challenging examples, AdaBoost enables subsequent classifiers to improve their performance. The algorithm starts by assigning equal weights to all training examples, then iterates through training weak classifiers, adjusting sample weights and combining classifier predictions based on their performance. This process continues for a specified number of iterations, resulting in a final prediction based on the weighted votes of all weak classifiers (Wizards, 2023).

1.3.5 Learning Curve

A learning curve is applied to illustrate how well a model performs based on the amount of training data. It helps identify learning issues like underfitting or overfitting and assesses dataset representativeness. By comparing training and validation scores across different training set sizes, learning curves reveal how much the model improves with more data and whether its limitations are due to bias or variance errors (Giola et al, 2021).

1.3.6 Overfitting

Overfitting is when a machine learning model is too focused on the training data it’s seen before, so it struggles to make accurate predictions for new data. It’s like a student who memorizes answers but can’t solve new problems (Muralidhar, 2023).

Reasons behind overfitting: 1. Using a complex model for a simple problem which picks up the noise from the data. Example: Fitting a neural network to the Iris dataset. 2. Small datasets, as the training set may not be a right representation of the universe (What Is Overfitting? - Overfitting in Machine Learning Explained - AWS, n.d.).

For example, a model trained to find dogs in outdoor photos might miss dogs indoors because it learned to look for grass.

To spot overfitting, we test the model with more diverse data. One method is called K-fold cross-validation, where we split the training data into subsets and test the model’s performance on each.

To prevent overfitting, we can use strategies like early stopping, where we pause training before the model learns too much noise. Pruning focuses on important features and ignores irrelevant ones (Muralidhar, 2023).

1.4 Changing the hyperparameters of the models

In this project we will be changing manually some of the hyperparameters of the best performing model within each specific dataset. Those two hyperparameters will be number of estimators and the maximum depth, both at the Random Forest Classifier, and Gradient Boosting Classifier. So, before we delve with the project we first should know that what are those hyperparameters specifically.

1.4.1 Random Forest Classifier Hyperparameters

  • Number of estimators - According to (Scikit-learn, 2018) it is “the number of trees in the forest. The default number of estimators is 100”.

  • Maximum depth - According to (Scikit-learn, 2018) it is “the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. The default maximum depth is None”.

1.4.2 Gradient Boosting Classifier

  • Number of estimators - According to (3.2.4.3.5. Sklearn.ensemble.GradientBoostingClassifier — Scikit-Learn 0.20.3 Documentation, 2009) it is “The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. Values must be in the range 1 to infinity. The default number of estimators is 100”.

  • Maximum depth - According to (3.2.4.3.5. Sklearn.ensemble.GradientBoostingClassifier — Scikit-Learn 0.20.3 Documentation, 2009) it is “the maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. The default maximum depth is 3”.

2 Business Sector: Hotel Reservation Dataset

2.1 Hotel: Data Overview

The Hotel Reservations Dataset was taken from Kaggle (available from this link: https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset). This dataset has information related to hotel bookings from July of 2017 till December of 2018. It consists of 36,275 observations, each representing a unique booking. The dataset covers 19 different attributes that provide insights into the booking patterns, guest preferences, and hotel operations.

Below, we will find the table and the meaning of each of the variables in this dataset:

Column Name Meaning
Booking_ID Unique identifier for each booking
no_of_adults Number of adults included in the booking
no_of_children Number of children included in the booking
no_of_weekend_nights Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
no_of_week_nights Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel
type_of_meal_plan Type of meal plan booked by the guest
required_car_parking_space Indicates whether the guest required a parking space (0 - No, 1- Yes)
room_type_reserved Type of room booked by the guest, ciphered (encoded) by INN Hotels
lead_time Number of days between the booking date and the arrival date
arrival_year Year of the guest’s arrival
arrival_month Month of the guest’s arrival
arrival_date Day of the month the guest arrived
market_segment_type Segment to which the booking belongs, indicating the source or market type of the booking
repeated_guest Indicates whether the guest is a repeated visitor (1 for repeated, 0 for new)
no_of_previous_cancellations Number of previous bookings canceled by the guest
no_of_previous_bookings_not_canceled Number of previous bookings not canceled by the guest
avg_price_per_room Average price per room for the booking
no_of_special_requests Number of special requests made by the guest (e.g. high floor, view from the room, etc)
booking_status Indicates if the booking was canceled or not

2.2 Hotel: Preprocessing Steps

We can see that Booking_ID(nunique=36275), type_of_meal_plan(nunique=4), room_type_reserved(nunique=7), market_segment_type(nunique=5) and target variable booking_status(nunique=2) are all object variables. Therefore for some of them we can do Label Encoding to help our machine learning models in the next steps. Moreover, we will delete the first column “Booking_ID” because it doesnt have any importance for our analysis.

For now, we want to check if there are any missing values in this data set:

no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

From this, we can see that there are no missing values in this data set, so for now we are not going to remove anything else.

2.2.1 Label Encoding

Now, we want to use Label Encoding for the variables: type_of_meal_plan, room_type_reserved, market_segment_type and booking_status.

Unique Values of type_of_meal_plan:
[1 0 2 3]
Unique Values of room_type_reserved:
[1 4 2 6 5 7 3]
Unique Values of market_segment_type:
[0 1 2 3 4]
Unique Values of booking_status:
[0 1]

Now, all the variables in our dataset are numerical.

However, we have 3 variables (arrival_date, arrival_month, arrival_year) which have to do with the date of the arrival, therefore we want to merge the columns into one date column, and drop these 3 unnecessary columns (arrival_year, arrival_month, arrival_date).

no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status arrival_date_full
0 2 0 1 2 1 0 1 224 0 0 0 0 65.00 0 0 2017-10-02
1 2 0 2 3 0 0 1 5 1 0 0 0 106.68 1 0 2018-11-06
2 1 0 2 1 1 0 1 1 1 0 0 0 60.00 0 1 2018-02-28
3 2 0 0 2 1 0 1 211 1 0 0 0 100.00 0 1 2018-05-20
4 2 0 1 1 0 0 1 48 1 0 0 0 94.50 0 1 2018-04-11

2.3 Hotel: Exploratory Data Analysis

2.3.1 Check whether the dataset is imbalanced

Now we want to see if our data is balanced or not:

Class Distribution:
0    24390
1    11885
Name: booking_status, dtype: int64

Class Proportions:
0    0.672364
1    0.327636
Name: booking_status, dtype: float64

Imbalance Ratio (Class 1 / Class 0): 0.487289872898729

According to this, with Class Proportions of 0 (Not cancelled) at 67% and 1 (Cancelled) at 33%, it appears that our dataset is not significantly imbalanced. Therefore, we can proceed with our analysis.

2.3.2 Booking Status Over Time

Now, we want to use Plotly Express to visualize the booking status over time (for the years 2017 and 2018). We filter the dataset to isolate records for each respective year and then create line plots to display the trend of booking status over time:

The green line presents the Total of Bookings, the blue line presents the Non-Cancelled Bookings, and the red line the Cancelled Bookings.

As we can see from this visualization, the number of non-cancelled bookings didn’t change much between 2017 and 2018, while the number of cancelled bookings did. It seems that the number of total bookings dropped down quite significantly from October 2017 and started to get back on track from January 2018. During this period the number of cancelled bookings was nearly zero, that’s why the graph lines of Total Booking and Non-Cancelled Bookings almost overlapping. Moreover, the number of cancelled bookings increased as the total number of bookings increased.

While the blue line or “Non-Cancelled Bookings” are considered as “Neto Bookings” because these are calculated by subtracting the number of Cancelled Bookings from the Total Bookings. We observe a significant drop in Net Bookings from 1611 on October 1, 2017, to 620 on November 1, 2017. Following this decline, Net Bookings began to gradually increase until March 2018. After March 2018, the Net Bookings remained relatively stable for the subsequent months of 2018, therefore we can state that there doesn’t appear to be any anomalies.

2.3.3 Correlation Heatmap for the Hotel Dataset - numerical features

In this heatmap we can see that there are much variables that are highly correlated to each other. For example, booking_status is moderately positively correlated with the lead_time, which means if the guests books for a later arrival day, it will be more likely that the booking will be cancelled. On the other hand, if the guest books at the last minute, it will be more likely that the booking will be cancelled. We also have the room_type_reserved positivey correlated with avg_price_per_room, which is obvious, the better the room is the higher is going to be the price (means that Room 7 is quite more premium than Room 1). We also have repeated_guest positively correlated with no_of_previous_bookings_not_canceled, which is also clear, since returning guests tend to have a higher number of previous bookings that were not canceled, indicating their satisfaction and loyalty. Moreover, guests who have cancelled ten to not book again.

2.3.4 Boxplot of the numerical features for the Hotel Dataset

This next code shows boxplots to visualize the distribution of numerical variables in the hotel dataset.

After looking at the boxplots: Most of the boxplots seem normal, but one stands out: the average price. It’s boxplot has an outlier around 500. This comes from a canceled booking, which means the person didn’t stay at the hotel. We decide to keep it in our data because it’s important for the model to learn from all kinds of situations. Another number we find odd is the count of children in some bookings. We see some with 9 or 10 children. This seems strange for a hotel booking, so we decide to remove those from our data.

2.3.5 Histogram for the numerical features for the Hotel Reservations Dataset

In order to see the skewness of the numerical features we need to plot histograms for each of the variables. If we see that a particular one of the variables are skewed that we can use the logarithmic properties in order to make that particular feature normally distributed.

From here, wer can see that only the variable lead_time is skewed (positively). Therefore we want to use log transformation to it to make it normally distributed.

Now, we want to remove the lead_time variable from the data set and only use the log transformed one for our analysis.

2.4 Hotel: Modeling

2.4.1 Modeling Summary

For this part of the project, we explore various machine learning models to predict hotel booking_status. We start by preprocessing the data, including splitting it into training and testing sets, with a 0.2 ratio, and standardizing the features using StandardScaler. Next, we use the SelectKBest method to identify the top features for modeling. Then, we train multiple classifiers including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and KNN classifiers. For each model, we evaluate its performance using metrics such as accuracy, precision, recall, F1-score, and ROC AUC score. Finally, we analyze the results to determine the best-performing model for predicting hotel booking statuses.

Model Accuracy Precision Recall F1-score ROC AUC Score Computational Time
0 Logistic Regression Classifier 0.779738 0.683010 0.598214 0.637806 0.732515 0.200245
1 Logistic Regression Classifier Scaled 0.778360 0.676974 0.605017 0.638976 0.733265 0.056284
2 Logistic Regression Classifier with Feature Se... 0.714404 0.630354 0.287840 0.395213 0.603435 0.165056
3 Logistic Regression Classifier with Feature Se... 0.716885 0.623960 0.318878 0.422060 0.613345 0.045115
4 Decision Tree Classifier 0.856788 0.770499 0.795068 0.782591 0.840732 0.140496
5 Decision Tree Classifier Scaled 0.855961 0.770376 0.791667 0.780876 0.839235 0.143149
6 Decision Tree Classifier with Feature Selection 0.796416 0.686091 0.685799 0.685945 0.767640 0.112617
7 Decision Tree Classifier with Feature Selectio... 0.797932 0.689478 0.685374 0.687420 0.768651 0.109633
8 Random Forest Classifier 0.887939 0.848348 0.796769 0.821750 0.864222 3.957353
9 Random Forest Classifier Scaled 0.887388 0.847757 0.795493 0.820794 0.863482 3.951000
10 Random Forest Classifier with Feature Selection 0.805100 0.697723 0.703656 0.700677 0.778710 4.063125
11 Random Forest Classifier with Feature Selectio... 0.808408 0.707686 0.696854 0.702228 0.779388 4.118062
12 Gradient Boosting Classifier 0.847140 0.816285 0.681973 0.743109 0.804172 2.638402
13 Gradient Boosting Classifier Scaled 0.847140 0.816285 0.681973 0.743109 0.804172 2.698068
14 Gradient Boosting Classifier with Feature Sele... 0.775879 0.696855 0.546344 0.612488 0.716166 1.996979
15 Gradient Boosting Classifier with Feature Sele... 0.775879 0.696855 0.546344 0.612488 0.716166 1.980496
16 KNN Classifier 0.840524 0.764030 0.735119 0.749296 0.813103 0.481364
17 KNN Classifier Scaled 0.852653 0.779521 0.760629 0.769959 0.828714 1.568069
18 KNN Classifier with Feature Selection 0.790903 0.676682 0.679847 0.678261 0.762012 0.302915
19 KNN Classifier with Feature Selection Scaled 0.795589 0.692512 0.664541 0.678238 0.761497 0.721915
20 AdaBoost Classifier 0.807030 0.718951 0.664541 0.690676 0.769962 0.841016
21 AdaBoost Classifier Scaled 0.807030 0.718951 0.664541 0.690676 0.769962 0.845534
22 AdaBoost Classifier with Feature Selection 0.756720 0.674598 0.482143 0.562361 0.685289 0.726621
23 AdaBoost Classifier with Feature Selection Scaled 0.756720 0.674598 0.482143 0.562361 0.685289 0.680749

From this final table containing all the results from the models we used, we can observe that the Random Forest Classifier (which performs almost the same as the Random Forest Scaled) demonstrates the best performance for this dataset. It achieved the highest scores across all measures: Accuracy (~89%), which means that the model about 89% correctly predicts whether the booking will be cancelled or not. Precision (~85%), which means that out of all the bookings the model predicted as cancelled, 85% of them were actually cancelled. Recall (~80%), which means that the model correctly identifies 80% of the actual bookings of cancelled. F1-score (~82%), which means that that the model has achieved a pretty good score of precision and recall. ROC&AUC Score (~86%), which shows that there is a great discrimination between the positive and negative classes. Although it has a computational time of approximately 4 seconds, this is not considered significant given its best performance.

2.4.2 Best Model Performance

For this dataset, the Random Forest Classifier stood out as the best performer across the different metrics like ‘Accuracy’, ‘Precision’, ‘Recall’, ‘F1-score’, ‘ROC AUC Score’, and ‘Computational Time’. So, we’re going to focus more on this model for the next step of our project. We’ll do tasks like Cross-Validation to see if the results of the original version are reliable and consistent, and checking for any signs that our model might be too focused on the training data (over-fitting).

In this section, we’ll rerun the Random Forest Classifier model and compare it with its Cross-Validated version.

Model Accuracy Precision Recall F1-score ROC AUC Score Computational Time
0 Random Forest Classifier 0.883942 0.843324 0.788870 0.815189 0.859238 3.856658
1 Random Forest Classifier (CV) 0.883689 0.852845 0.780902 0.816014 0.940655 16.370657

As we can see, the difference Random Forest model and its Cross Validated version of it, is very small, therefore we can state that the results of Random Forest Classifier are reliable to continue with our anaysis.

2.5 Hotel: Additional Techiniques

2.5.1 Learning Curve

In this next code, we are going to check the learning curve of the model depending on the training samples (examples). We are going to show 3 different learning curves, where the first one is the original model, second one has max_depth=10 and n_estimators=100, and third one has max_depth=30 n_estimators=100. We want to compare how these three learning curves look and check if there’s overfitting possibilities for these models.

Hotel Figure 1: Learning curve, through different samples

The learning curve plot, as shown in Hotel Figure 1, shows how well the model learns as we give it more examples to study. When we train it with 5000 examples, it gets everything right for the training data, showing it can remember all those examples perfectly. On the other hand, for the testing data, it performs the best when we use all 30,000+ samples from the dataset.

This insight helps us understand that the model is able to perform even better for the testing data if we added more new samples to the data set. Moreover, a more varied dataset can lead to better generalization of the model, enabling it to handle new data more effectively.

2.5.2 Checking for Overfitting

In this step, we want to check overfitting with the Random Forest Classifier Dataset using different settings: max depth ranging from 1 to 30, and the number of estimators set at 50, 100, and 150. We’re trying out different setups to see how well the model works with different levels of difficulty. This helps us find the best mix of making the model smart enough without making it too specific to certain cases.

2.5.2.1 Random Forest Classifier with 50 estimators

Hotel Figure 3a: Random Forest Performance with 50 estimators and max depth from 1-30

2.5.2.2 Random Forest Classifier with 100 estimators

Hotel Figure 3b: Random Forest Performance with 100 estimators and max depth from 1-30

2.5.2.3 Random Forest Classifier with 150 estimators

Hotel Figure 3c: Random Forest Performance with 150 estimators and max depth from 1-30

When exploring different combinations of the Random Forest Classifier’s parameters—specifically, the maximum depth and the number of estimators—we analyzed the results depicted in Hotel Figure 3a), 3b), and 3c). We found that increasing the maximum depth generally improved the model’s performance on the training set, including metrics like accuracy, precision, recall, F1-score, and ROC AUC score. However, this improvement wasn’t as pronounced for the testing set.

The best maximum depth range seemed to be between 10 to 15. In this range, the differences between testing and training scores were smaller compared to maximum depths between 15 to 30.

When considering the number of estimators, there wasn’t much difference between having 50, 100, or 150 estimators. It seems that the number of estimators didn’t have a significant impact on how well the model learned.

In conclusion, for this model, it appears that any number of estimators between 50 to 150 is suitable. However, a maximum depth in the range of 10 to 15 seems to lead to the most balanced performance between the training and testing datasets.

2.6 Hotel: Key Findings

The dataset didn’t show class imbalance, with approximately 67% of bookings not being canceled and 33% being canceled.

Visualizing the booking status over time revealed interesting trends. While the number of non-canceled bookings remained relatively stable, the number of canceled bookings varied over time. There were significant drops in net bookings during certain periods, followed by gradual recovery.

The correlation heatmap showed several variables that were highly correlated with each other. For example, lead time was positively correlated with booking status, indicating that longer lead times were associated with a higher likelihood of booking cancellation.

Various machine learning models were evaluated, including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and KNN classifiers. Among these, the Random Forest Classifier demonstrated superior performance as the best-performing model for predicting hotel booking statuses. It achieved high scores across various metrics, including accuracy, precision, recall, F1-score, and ROC AUC score.

Learning curves and overfitting analysis were conducted to ensure the model’s generalization ability. The results indicated that a maximum depth in the range of 10 to 15 led to balanced performance between training and testing datasets.

3 Environment Sector: Weather in Australia Dataset

3.1 Weather: Data Overview

The Weather in Australia Dataset was taken from Kaggle (available from this link: https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package). This dataset contains 145,460 observations and 19 variables for the weather which 14 of them are numerical features and the rest of them are categorical and type date.

Below, we will find the table and the meaning of each of the variables in this dataset:

Column Name Meaning
Date The date of the observation
Location he common name of the location of the weather station
MinTemp The minimum temperature in degrees celsius
MaxTemp The maximum temperature in degrees celsius
Rainfall The amount of rainfall recorded for the day in mm
Evaporation The so-called Class A evaporation (mm) in the 24 hours to 9am
Sunshine The number of hours of bright sunshine in the day
WindGustDir The direction of the strongest wind gust in the 24 hours to midnight
WindGustSpeed The speed (km/h) of the strongest wind gust in the 24 hours to midnight
WindDir9am Direction of the wind at 9am
WindDir3pm Direction of the wind at 3pm
WindSpeed9am Speed of the wind 10 min prior to 9am (km/h)
WindSpeed3pm Speed of the wind 10 min prior to 3pm (km/h)
Humidity9am Humidity of the wind at 9am
Humidity3pm Humidity of the wind at 3pm
Pressure9am Atmospheric pressure at 9am
Pressure3pm Atmospheric pressure at 3pm
Cloud9am cloud-obscured portions of the sky at 9am (eighths)
Cloud3pm cloud-obscured portions of the sky at 3pm (eighths)
Temp9am Temperature at 9am (degree Celsius)
Temp3pm Temperature at 3pm (degree Celsius)
RainToday If today is rainy then ‘Yes’, if not then ‘No’
RainTomorrow Target Variable: If tomorrow is rainy then ‘Yes’, if not then ‘No’

3.2 Weather: Preprocessing Steps

we can see that some of the variables are the type of object, but first let’s see the missing values and decide if whether there are any missing values, and if so are we going to delete row-wise or column-wise?

From the heatmap we can see that the features of the dataset which have more than 50 percent of the total observations as null values are Evaporation, Sunshine, Cloud9am and Cloud3pm and we decide to remove them from the dataset, since we believe that these variables are not key factors in determining whether it is going to rain tomorrow or not. Moreover, since in this project the focus is not in the implementing different imputation techniques in order to see how well they would perform in different machine learning algorithms, we decide to get rid of them.

Then we will check if there are any duplicates observations, since they would lead to biased results, therefore we need to delete them.

And also after deleting the columns, we then need to check also how many missing values in each column we have and we decide now to delete row-wise. If we check the target variable ‘Rain Tomorrow’ we can see that this feature has 3267 missing observations and therefore we should delete and even if suppose we would have used imputation techniques we should have not impute this column because it will lead to biased results.

After this preprocessing steps, our dataset now has 112925 rows and 19 features, which is not a big loss of information.

3.3 Weather: Exploratory Data Analysis

3.3.1 Check whether the dataset is imbalanced

Now, we want to see whether the target variable which is RainTomorrow with value 1 when it will rain, and 0 when it won’t rain. We want to see whether there is an imbalance between the two classes.

We can see that the class 0 has 77.84% of the total observations in the dataset and class 1 holds 22.16% of the total observations.

From this we can say that the dataset is not imbalanced.

3.3.2 Correlation Heatmap for the numerical features

After the preprocessing steps we want to see that how correlated are with each other the numerical variables in the dataset.

We can see that variables that show the status in different time of the day are strongly correlated with each other, which means that if the temperature is high at 9AM is it expected to be high also during 3PM, and vice versa.

3.3.3 Boxplot of the numerical features

After that we want to see a little bit more, the distribution of the numerical features and see whether they are any specific outliers and that might have an affect in the overall prediction task.

We can see that for the rainfall which is measured in millimetres there are a lot of outliers but the furthest one is 300mm which could indicate a real day when there was raining that amount of rain in a specific area, and then for the wind speed we can see that there are outliers until 80 km/h which could be logical. When looking at each of the variables the outliers make sense, and therefore we do not want them to remove.

3.3.4 Histogram for the numerical features

In order to see the skewness of the numerical features we need to plot histograms for each of the variables.

We can see that most of the numerical features show a normal distribution, only for the three plots in the second row, we can see that they are sightly left-skewed.

3.4 Weather: Modelling

3.4.1 Modeling Summary

Now, before applying only the six classification algorithms, we create new dataframes where one of them is the dataset with all variables, the other one is the dataset with all variables with a mean zero and standard deviation 1, we apply the SelectKBest algorithm to check which are the ten best features of the 19 that the dataset has after the preprocessing part, and those variables are ‘MaxTemp’, ‘Rainfall’, ‘WindGustSpeed’, ‘WindSpeed3pm’, ‘Humidity9am’, ‘Humidity3pm’, ‘Pressure9am’, ‘Pressure3pm’, ‘Temp3pm’, ‘RainToday’, and then we will check the performance metrics of the 6 different classification algorithms, and then the last dataset would be the one with 10 best feature selection variables with scaled data, mean of zero and standard deviation of 1.

Now that we move into the modelling part, we first do the splitting of the training and testing sets, with a 0.2 ratio, and the random state is set, since we do not want to have different results each time the code runs, and to be put differently into the project report. ning and is there a chance that there might be an issue of overfitting?

Model Accuracy Precision Recall F1-score ROC AUC Score Computational Time
0 Logistic Regression Classifier 0.844454 0.732572 0.481723 0.581237 0.715468 0.892187
1 Logistic Regression Classifier Scaled 0.846757 0.732829 0.497530 0.592680 0.722572 0.282204
2 Logistic Regression Classifier with Feature Se... 0.847288 0.739596 0.491602 0.590623 0.720807 0.802855
3 Logistic Regression Classifier with Feature Se... 0.848085 0.737471 0.500099 0.596020 0.724342 0.286957
4 Decision Tree Classifier 0.789329 0.529062 0.544952 0.536889 0.702429 1.421627
5 Decision Tree Classifier Scaled 0.788311 0.526985 0.540209 0.533515 0.700086 1.532786
6 Decision Tree Classifier with Feature Selection 0.783662 0.516824 0.531120 0.523874 0.693858 0.979467
7 Decision Tree Classifier with Feature Selectio... 0.784414 0.518363 0.535467 0.526776 0.695889 0.952178
8 Random Forest Classifier 0.856940 0.771675 0.513535 0.616681 0.734826 24.751709
9 Random Forest Classifier Scaled 0.858446 0.775089 0.518870 0.621612 0.737693 24.805765
10 Random Forest Classifier with Feature Selection 0.853133 0.759833 0.503853 0.605917 0.728929 19.058738
11 Random Forest Classifier with Feature Selectio... 0.852557 0.755234 0.506026 0.606010 0.729331 19.643665
12 Gradient Boosting Classifier 0.852956 0.755732 0.508002 0.607586 0.730291 21.357315
13 Gradient Boosting Classifier Scaled 0.852646 0.753437 0.508990 0.607547 0.730442 21.533165
14 Gradient Boosting Classifier with Feature Sele... 0.849369 0.745052 0.498320 0.597206 0.724537 13.892945
15 Gradient Boosting Classifier with Feature Sele... 0.849369 0.743041 0.501087 0.598537 0.725521 13.645543
16 KNN Classifier 0.844100 0.708559 0.516894 0.597738 0.727746 3.558234
17 KNN Classifier Scaled 0.835865 0.684872 0.495554 0.575032 0.714851 3.281919
18 KNN Classifier with Feature Selection 0.838743 0.687253 0.514523 0.588475 0.723451 4.266080
19 KNN Classifier with Feature Selection Scaled 0.838831 0.686924 0.515906 0.589258 0.724000 5.603025
20 AdaBoost Classifier 0.846580 0.735676 0.492195 0.589795 0.720561 5.064919
21 AdaBoost Classifier Scaled 0.846402 0.735642 0.491010 0.588932 0.720025 5.170863
22 AdaBoost Classifier with Feature Selection 0.845074 0.736667 0.480340 0.581509 0.715375 3.629167
23 AdaBoost Classifier with Feature Selection Scaled 0.845163 0.737401 0.479945 0.581448 0.715292 3.525300

After running the six classification machine learning algorithms (on the 24 supposed datasets), we can see that the best performing model when comparing the metrics such as accuracy, precision, recall, f1 score, roc auc score and the computational time is Random Forest Classifier with scaled data (mean 0, and standard deviation of 1).

  • Accuracy: In this case, it is 85.84%, which means that the model correctly predicts whether it will rain or not about 85.84% of the time. In the context of predicting weather conditions, accuracy is crucial as it directly reflects the model’s ability to provide reliable forecasts, which is valuable for making informed decisions, planning activities, and managing resources effectively.

  • Precision: A precision of 77.51% means that out of all the instances the model predicted as rain, 77.51% of them were actually rain.

  • Recall: Also known as sensitivit, and with a score of 51.89% means that the model correctly identifies 51.89% of the actual instances of rain.

  • F1-score: In this case, the F1-score is 62.16%. A higher F1-score indicates better model performance. An F1-score of 62.16% suggests that the model has achieved a fair balance between precision and recall

  • ROC AUC Score: ROC AUC score, in this case, 73.77% indicates a very good discrimination between the positive and negative classes.

  • Computational Time: It took approximately 24.81 seconds for the model to train and make predictions.

3.4.2 Best Model Performance

Now we are curious about how Random Forest performs with scaled data, both with and without cross-validation, will perform, will it be consistent, or will there be differences?

Model Accuracy Precision Recall F1-score ROC AUC Score Computational Time
0 Random Forest Classifier 0.857118 0.770183 0.516499 0.618332 0.735994 29.336895
1 Random Forest Classifier Scaled 0.858313 0.774727 0.518475 0.621212 0.737467 28.821689
2 Random Forest Classifier with Feature Selection 0.852956 0.754982 0.508990 0.608049 0.730642 22.569157
3 Random Forest Classifier with Feature Selectio... 0.852159 0.754734 0.504051 0.604431 0.728372 21.172029
4 Random Forest Classifier Scaled with Cross Val... 0.856830 0.761068 0.513028 0.612894 0.885441 117.668163

Overall, from the table above we can see that both models have similar accuracy and performance in terms of precision, recall, and F1-score, the second model with cross-validation demonstrates superior discrimination between classes, as evidenced by its higher ROC AUC score. However, this improvement comes at the cost of increased computational time.

When the results of a model with and without cross-validation are almost the same, it means the model is consistent and doesn’t rely heavily on how the data is split for validation. This suggests the model is stable and can generalize well to new data.

3.5 Weather: Additional Techniques

3.5.1 Learning Curve

Now, for the best performace model, we want to see the learning curve of this model.

Weather Figure 1: Learning curve, through different samples

The learning curve plot, as shown in Weather Figure 1, reveals that the model achieves a perfect score when trained on 10,000 samples, indicating its ability to memorize the training data entirely. However, the most significant improvement in performance for the testing data occurs when using 30,000 samples. Beyond this point, additional samples result in minimal enhancements in performance. So, when the training accuracy remains relatively stable while the testing accuracy improves slightly with an increase in the number of samples, it indicates that the model is learning to generalize better as more data is provided.

3.5.2 Checking for Overfitting

Now we want to check overfitting with the Random Forest Classifier Scaled Dataset using different settings: max depth ranging from 1 to 20, and the number of estimators set at 50, 100, and 150. By testing various configurations, we aim to understand how the model’s performance changes with different complexities. This helps us identify the optimal balance between model complexity and generalization ability.

3.5.2.1 Random Forest Classifier with 50 estimators

Weather Figure 2a: Random Forest Performance with 50 estimators and max depth from 1-20

3.5.2.2 Random Forest Classifier with 100 estimators

Weather Figure 2b: Random Forest Performance with 100 estimators and max depth from 1-20

3.5.2.3 Random Forest Classifier with 150 estimators

WeatherFigure 2c: Random Forest Performance with 150 estimators and max depth from 1-20

When exploring different combinations of max depth and number of estimators for the Random Forest Classifier, we observed from Weather Figure 2a), 2b), and 2c), that increasing the max depth generally led to improved performance metrics on the training set, including accuracy, precision, recall, F1-score, and ROC AUC score. However, the performance on the testing dataset showed fluctuations, with some max depths performing better than others. From the plots, it’s evident that the training scores consistently improve with increasing max depth, but the testing scores fluctuate, indicating potential overfitting.

The optimal max depth appears to be in the range of 6 to 8, where the differences in performance metrics between different depths are minimal, suggesting a balance between model complexity and generalization. This range offers good performance on both the training and testing datasets while reducing the risk of overfitting.

Interestingly, varying the number of estimators 50, 100, and 150 in the Random Forest did not significantly impact the shape or trend of the learning curves. Despite differences in the number of trees in the forest, the overall behavior of the model, as reflected in the learning curves, remained consistent. This suggests that increasing the number of estimators beyond a certain point may not lead to substantial improvements in model performance. Therefore, it is very important to consider the trade-off between computational complexity and performance when selecting the number of estimators.

3.6 Weather: Key Findings

The Random Forest Classifier with 100 estimators and a maximum depth of 6 exhibits optimal performance for this dataset. Increasing the number of estimators beyond 100 does not significantly enhance model performance, suggesting diminishing returns. Effective preprocessing steps, including missing value handling and feature selection, contribute to improved model interpretability and performance. Additionally, the dataset’s balanced class distribution ensures robust model training and evaluation.

4 Health Sector: Cardiovascular Dataset

4.1 Cardiovascular: Data Overview

The Cardiovascular Dataset was taken from Kaggle (available from this link: https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset). This involves examining a healthcare dataset with the goal of predicting the reasons behind various diseases, particularly heart disease. This dataset has 308,854 observations and 19 features, including lifestyle factors, personal details, habits, and the presence of different diseases.

In this dataset, there are 12 categorical variables and 7 numerical variables. Below you can see the feature names along with their descriptions below.

Column Name Description
General_Health Indicates the general health status of the individual, categorized as ‘Poor’, ‘Very Good’, ‘Good’, ‘Fair’, or ‘Excellent’.
Checkup Indicates the frequency of medical checkups, with options such as ‘Within the past 2 years’, ‘Within the past year’, ‘5 or more years ago’, ‘Within the past 5 years’, or ‘Never’.
Exercise Indicates whether the individual engages in regular exercise, with options ‘Yes’ or ‘No’.
Heart_Disease Indicates the presence or absence of heart disease, with options ‘Yes’ or ‘No’.
Skin_Cancer Indicates the presence or absence of skin cancer, with options ‘Yes’ or ‘No’.
Other_Cancer Indicates the presence or absence of other types of cancer, with options ‘Yes’ or ‘No’.
Depression Indicates whether the individual suffers from depression, with options ‘Yes’ or ‘No’.
Diabetes Indicates the presence or absence of diabetes, with options including ‘Yes’ or ‘No’.
Arthritis Indicates the presence or absence of arthritis, with options ‘Yes’ or ‘No’.
Sex Indicates the gender of the individual, with options ‘Female’ or ‘Male’.
Age_Category Indicates the age category of the individual, such as ‘70-74’, ‘60-64’, ‘75-79’, ‘80+’, etc.
Height_(cm) Indicates the height of the individual in centimeters.
Weight_(kg) Indicates the weight of the individual in kilograms.
BMI Indicates the Body Mass Index (BMI) of the individual.
Smoking_History Indicates the smoking history of the individual, with options ‘Yes’ or ‘No’.
Alcohol_Consumption Indicates the frequency of alcohol consumption, measured in units.
Fruit_Consumption Indicates the frequency of fruit consumption per week, measured in servings.
Green_Vegetables_Consumption Indicates the frequency of green vegetables consumption per week, measured in servings.
FriedPotato_Consumption Indicates the frequency of fried potato consumption per week, measured in servings.

4.2 Cardiovascular: Exploratory Data Analysis

4.2.1 A Series of Boxplots

For the exploratory data analysis, we start generating a series of boxplots for numerical columns in the “cardio” dataset, representing the distribution of various health and lifestyle variables. These visualizations are helpful for understanding the spread and central tendencies of the data.

The provided data exhibits unusual extremes for height, weight, and BMI, with maximum values of 241 cm, 293 kg, and 99.33 respectively, as well as minimum values of 91 cm and 24 kg. Given that this data was collected from adults, such extremes are uncommon and likely represent outliers. These outliers should be removed during the data cleaning process to ensure the dataset’s integrity for analysis.

4.2.2 A Collection of Histograms

This visualization shows a collection of histograms, each representing the distribution of each numerical variable.

  • Height (cm): The data appears to be normally distributed, centered around a mean value which looks to be approximately 170 cm.

  • Weight (kg): Similar to the height distribution, it seems normally distributed, with a mean value somewhere around 60-100 kg.

  • BMI: Most of the people in this dataset have BMI value between 25 and 30 which is categorized as Overweight. However there are significant number of people who fall under Normal (18 - 25) and Obese (30 - 35) category.

  • Alcohol Consumption: Most of the people in this dataset consume very low or approximetely 0% of alcohol. This Alcohol consumption graph is heavily right skewed.

  • Fruit Consumption: This fruit consumption graph shows irregular patterns with multiple peaks, indicating variability in people’s diets.

  • Green Vegetables Consumption: Similar to fruit consumption, this variable also appears to be multimodal. There are peaks at the lower end of the scale, indicating that a portion of the population consumes green vegetables infrequently.

  • Fried Potato Consumption: Similar to fruit and green vegetables consumption, this histogram is not normal and is skewed to the right, with a large number of individuals consuming fried potatoes infrequently, and a few consuming them very frequently.

4.2.3 Target Variable: Heart_Disease

Next, we generated a histogram to show a comparison of individuals with and without heart disease.

As the result shown, the “No” bar is significantly higher than the “Yes” bar with 283,883 and 24,971, respectively, indicating a much larger number of individuals in the sample do not have heart disease.

4.2.4 Correlation Matrix

After that, we copied an original data and create a new dataframe and then converted categorical columns into numerical variables to perform a correlation matrix.

General_Health Checkup Exercise Heart_Disease Skin_Cancer Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_(cm) Weight_(kg) BMI Smoking_History Alcohol_Consumption Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
0 3 2 0 0 0 0 0 0 1 0 10 150.0 32.66 14.54 1 0.0 30.0 16.0 12.0
1 4 4 0 1 0 0 0 2 0 0 10 165.0 77.11 28.29 0 0.0 30.0 0.0 4.0
2 4 4 1 0 0 0 0 2 0 0 8 163.0 88.45 33.47 0 4.0 12.0 3.0 16.0
3 3 4 1 1 0 0 0 2 0 1 11 180.0 93.44 28.73 0 0.0 30.0 30.0 8.0
4 2 4 0 0 0 0 0 0 0 1 12 191.0 88.45 24.37 1 0.0 8.0 4.0 0.0

This heatmap is useful for quickly identifying potential relationships between health-related factors.

As our target variable is the presence of heart disease (Heart_Disease), we plot the bar chart to show the correlation coefficients of various factors with heart disease.

As shown, the factor most strongly correlated with heart disease is “Age_Category (0.23)”, followed by “Diabetes (0.17)”, “Arthritis (0.15)”, and “Smoking_History (0.11)”. Other factors like “Sex (0.07)”, “BMI (0.04)”, “Depression (0.03)”, and dietary habits have lower correlations. “Exercise (-0.10)” and “Alcohol Consumption (-0.04)” have the least correlation. Essentially, the chart identifies age, diabetes, arthritis, and smoking as having stronger associations with heart disease in the studied population.

4.3 Cardiovascular: Preprocessing Steps

4.3.1 Identifying and Handling Outliers

Before identifying and handling outliers, we started checking for missing data, including identifying and removing duplicate values. It showed that there are no missing values in each column of the dataset. However, we found 80 duplicated rows in this dataset, which were subsequently removed. Duplicate values or entries might occur if a person enters repeated values expectedly or unexpectedly. After this, we checked for the number of unique values in each column.

As some unusual extremes shown in the series of boxplot before, we will remove outliers from the Height, Weight, and BMI attributes in this step. However, we will retain outliers in the Alcohol, Fruit, Green Vegetables, and Fried Potato consumption attributes since their accuracy is uncertain.

To mitigate the influence of extreme cases on the results, outliers were excluded for 1,955 rows, aiming to prevent potential inaccuracies. As a result, our datasets still contain 306,899 observations.

4.3.2 Training and Testing Split

Before training and testing split, we performed label encoding on a copy of the original dataframe named “cardio_encoded”. After label encoding process, we checked the data types of variables in the “cardio_encoded” DataFrame to ensure that there are no string-format variables.

Then, we splited the dataset into training and testing sets and standardized the feature variables, preparing it for machine learning modeling. Finally, the scaled data is converted back into DataFrame format for further analysis.

4.3.3 Feature Selection: KBest

This step involves feature selection using the SelectKBest method with a k value of 10. The target variable is set to ‘Heart_Disease’. The best selected features include ‘Checkup’, ‘Exercise’, ‘Heart_Disease’, ‘Skin_Cancer’, ‘Depression’, ‘Diabetes’, ‘Arthritis’, ‘Sex’, ‘Height_(cm)’, and ‘BMI’. Finally, the selected features are applied to both the training and testing sets, as well as their scaled versions, by dropping the non-selected features from the datasets.

4.3.4 Improving Class Imbalance by Resampling (Undersampling)

In this step, we calculated the imbalance ratio of the target variable “Heart_Disease” in the dataset.

Cardio Figure 1: Imbalanced Class of Heart Disease

As the result shown in the Cardio Figure 1, this ratio highlights a potential class imbalance, indicating that one class (individuals with heart disease) may be underrepresented compared to the other class.

Then, we addressed the class imbalance issue through undersampling of the majority class. Initially, the dataset is divided into majority (no heart disease) and minority (heart disease present) classes. Then, a portion of the majority class is randomly downsampled to achieve a balanced class distribution with an 80:20 ratio between the majority and minority classes.

Now, the majority class 0 (not having heart disease) and the minority class 1 (having heart disease) have 99,204 and 24,801 observations, respectively, as you can see in the Cardio Figure 2.

Cardio Figure 2: Undersampling

After undersampling, the dataset is split into training and testing sets, followed by standardization of the features. Finally, the scaled data is converted back to DataFrames for further analysis.

4.4 Cardiovascular: Modeling

4.4.1 Modeling Summary

Model Accuracy Precision Recall F1-score ROC AUC Score Computational Time
0 Logistic Regression with All Variables 0.918899 0.422727 0.018811 0.036019 0.508280 1.467543
1 Logistic Regression with Scaled Data 0.918964 0.447552 0.025890 0.048948 0.511545 0.302805
2 Logistic Regression with Feature Selection 0.919029 0.400000 0.010518 0.020497 0.504568 1.168535
3 Logistic Regression with Scaled Feature Selection 0.919094 0.412698 0.010518 0.020513 0.504603 0.199655
4 Logistic Regression with Resampling 0.814080 0.580083 0.254839 0.354111 0.604361 0.671731
5 Logistic Regression with Scaled Resampling 0.814282 0.579443 0.260282 0.359210 0.606528 0.107246
6 Decision Tree with All Variables 0.862447 0.195792 0.227751 0.210566 0.572900 1.181523
7 Decision Tree with Scaled Data 0.862447 0.195792 0.227751 0.210566 0.572900 1.160869
8 Decision Tree with Feature Selection 0.900749 0.210977 0.084749 0.120924 0.528492 0.386444
9 Decision Tree with Scaled Feature Selection 0.900929 0.211568 0.084345 0.120607 0.528405 0.392200
10 Decision Tree with Resampling 0.749083 0.382161 0.412903 0.396938 0.623013 0.465445
11 Decision Tree with Scaled Resampling 0.749163 0.382039 0.411694 0.396312 0.622610 0.459432
12 Random Forest with All Variables 0.918638 0.434896 0.033778 0.062688 0.514967 18.796389
13 Random Forest with Scaled Data 0.918638 0.430556 0.031351 0.058446 0.513859 19.802505
14 Random Forest with Feature Selection 0.903486 0.220639 0.078277 0.115557 0.527027 12.010410
15 Random Forest with Scaled Feature Selection 0.903698 0.217747 0.075445 0.112062 0.525851 11.881666
16 Random Forest with Resampling 0.818394 0.589272 0.303427 0.400586 0.625279 7.939783
17 Random Forest with Scaled Resampling 0.817265 0.581930 0.306452 0.401479 0.625707 7.918733
18 Gradient Boosting with All Variables 0.919436 0.498660 0.037621 0.069964 0.517154 19.082682
19 Gradient Boosting with Scaled Data 0.919436 0.498660 0.037621 0.069964 0.517154 18.692237
20 Gradient Boosting with Feature Selection 0.919306 0.285714 0.001214 0.002417 0.500474 8.275284
21 Gradient Boosting with Scaled Feature Selection 0.919306 0.285714 0.001214 0.002417 0.500474 8.293972
22 Gradient Boosting with Resampling 0.823757 0.607130 0.336492 0.433000 0.641030 7.615644
23 Gradient Boosting with Scaled Resampling 0.823757 0.607130 0.336492 0.433000 0.641030 7.552372
24 KNeighbors with All Variables 0.911470 0.201946 0.033576 0.057579 0.510976 7.912075
25 KNeighbors with Scaled Data 0.909433 0.296761 0.090817 0.139074 0.535982 7.341052
26 KNeighbors with Feature Selection 0.911258 0.243629 0.048341 0.080675 0.517597 1.523488
27 KNeighbors with Scaled Feature Selection 0.911030 0.251681 0.052994 0.087552 0.519595 4.721148
28 KNeighbor with Resampling 0.777509 0.390416 0.200403 0.264855 0.561091 1.297843
29 AdaBoost with All Variables 0.918785 0.466882 0.058455 0.103901 0.526304 4.393100
30 AdaBoost with Scaled Data 0.918785 0.466882 0.058455 0.103901 0.526304 4.312562
31 AdaBoost with Feature Selection 0.919225 0.455696 0.014563 0.028224 0.506520 2.403135
32 AdaBoost with Scaled Feature Selection 0.919225 0.455696 0.014563 0.028224 0.506520 2.410494
33 AdaBoost with Resampling 0.822628 0.604469 0.327218 0.424591 0.636846 1.765493
34 AdaBoost with Scaled Resampling 0.822628 0.604469 0.327218 0.424591 0.636846 1.767104

Based on the results of the model performance, we can conclude the best model performance summary based on each matrix as follows:

  • Best Accuracy: The highest accuracy is achieved by both Gradient Boosting models with all variables and scaled data. This means the models are correct in their predictions about 91.94% of the time.

  • Best Precision: The models with the highest precision are Gradient Boosting with Resampling and Gradient Boosting with Scaled Resampling. This indicates that when these models predict heart disease, it is correct 60.71% of the time.

  • Best Recall: The model with the highest recall is **Decision Tree with Resampling, meaning this model correctly identifies about 41.29% of all true cases of heart disease.

  • Best F1-score: Gradient Boosting with Resampling and Gradient Boosting with Scaled Resampling have the highest F1-scores (43.30%), suggesting they have the best balance between precision and recall which means neither missing too many real cases (high recall) nor making too many false positive (high precision).

  • Best ROC AUC Score: Gradient Boosting with Resampling and Gradient Boosting with Scaled Resampling achieve the highest ROC AUC scores (64.10%), indicating their strong ability to distinguish between patients with and without heart disease.

  • Best Computational Time: The Logistic Regression with Scaled Resampling is the fastest to make its predictions, making it the best choice if you need quick prediction

For heart disease prediction, it is essential to select a model that not only has high accuracy but also a strong ability to correctly identify as many actual cases as possible (high recall) and correctly predict heart disease when it is truly present (high precision). Additionally, the ability to distinguish between the classes (high ROC AUC) and a good balance between precision and recall (high F1-score) are particularly important.

Considering the criticality of all these metrics in a healthcare context, the Gradient Boosting with Resampling and Gradient Boosting with Scaled Resampling models are the best choices. They not only provide the highest precision and F1-scores, indicating a robust balance between precision and recall, but also the highest ROC AUC scores, demonstrating excellent discriminative ability. However, we will select only the Gradient Boosting with Resampling for further analysis since scaling resampling did not significantly impact the model’s effectiveness compared to using original resampling techniques.

4.4.2 Best Model Performance

Now, we want to see how Gradient Boosting models with all variables and resmapling data perform with and without cross-validation. Are the results consistent, or do they vary?

Model Accuracy Precision Recall F1-score ROC AUC Score Computational Time
0 Gradient Boosting with All Variables 0.919436 0.498660 0.037621 0.069964 0.517154 19.295390
1 Gradient Boosting with All Variables (CV) 0.919611 0.536702 0.047187 0.086730 0.836876 74.838946
2 Gradient Boosting with Resampling 0.823757 0.607130 0.336492 0.433000 0.641030 8.050771
3 Gradient Boosting with Resampling (CV) 0.822699 0.604184 0.329218 0.426168 0.835305 31.133341

Overall, both Gradient Boosting models demonstrate consistent performance between the original and cross-validated versions. However, the cross-validated models generally exhibit higher precision, recall, and F1-score, indicating better generalization and robustness. Nevertheless, this improvement in performance comes with the trade-off of increased computational time.

When a model produces similar results with and without cross-validation, it indicates that the model’s performance is consistent and not heavily influenced by how the data is split for validation. This suggests that the model is robust and capable of generalizing effectively to unseen data.

4.5 Cardiovascular: Additional Techiniques

4.5.1 Learning Curve

After evaluating the performance of various models for heart disease prediction, it is clear that Gradient Boosting models, especially those with resampling techniques, outperform others. Now, for the best model performance, we want to see the learning curve of this model.

Cardio Figure 3: Learning Curve for Gradient Boosting with All Variables

The learning curve plot, as shown in Cardio Figure 3, reveals that test score plateaus around the 100,000 to 125,000 samples mark. Therefore, using a training set size within this range would be appropriate and efficient for training this particular Gradient Boosting model, as it achieves a balance between model performance and computational efficiency. Beyond this range, the benefit of additional samples diminishes.

Cardio Figure 4: Learning Curve for Gradient Boosting with Resampling

The learning curve plot, as shown in Cardio Figure 4, reveals that the proper number of samples would be 40,000 samples, as beyond this point, the test score improvement is marginal, indicating that adding more samples may not significantly improve the model’s performance on unseen data. Therefore, a training set size slightly less than 40,000 samples might be optimal for this Gradient Boosting model with resampling.

4.5.2 Checking for Overfitting

We aim to assess overfitting in Gradient Boosting models trained on all variables and resampled data by experimenting with different hyperparameter settings. Specifically, we will vary the maximum depth from 1 to 20 and fix the number of estimators at 50, 100, and 150. By exploring these configurations, we seek to analyze how the model’s performance evolves with varying complexities. This investigation will allow us to determine the optimal trade-off between model complexity and generalization capacity.

4.5.2.1 Checking for Overfitting: Gradient Boosting with All Variable

4.5.2.1.1 Gradient Boosting Classifier with 50 estimators

Cardio Figure 5a: Gradient Boosting Performance with 50 estimators and max depth from 1-20
4.5.2.1.2 Gradient Boosting Classifier with 100 estimators

Cardio Figure 5b: Gradient Boosting Performance with 100 estimators and max depth from 1-20
4.5.2.1.3 Gradient Boosting Classifier with 150 estimators

Cardio Figure 5c: Gradient Boosting Performance with 150 estimators and max depth from 1-20

In our investigation of various combinations of max depth and number of estimators for the Gradient Boosting Classifier, shown in Cardio Figure 5a), 5b), and 5c), we observed a consistent trend: increasing the max depth generally improved performance metrics on the training set, including accuracy, precision, recall, F1-score, and ROC AUC score. However, performance on the testing dataset exhibited fluctuations, with certain max depths performing better than others. These observations suggest a potential risk of overfitting.

The optimal max depth tends to fall within the range of 10 to 12, striking a balance between model complexity and generalization ability across different configurations of n_estimators (50, 100, and 150). This range consistently delivers good performance on both the training and testing datasets while mitigating the risk of overfitting.

Interestingly, the choice of n_estimators does not significantly alter the observed trends. Although higher values may offer slightly better performance, the overall behavior of the model, as reflected in the learning curves, remains consistent.

4.5.2.2 Checking for Overfitting of Gradient Boosting with Resampling Data

4.5.2.2.1 Gradient Boosting Classifier (Resampling) with 50 estimators

Cardio Figure 6a: Gradient Boosting (Resampling) Performance with 50 estimators and max depth from 1-20
4.5.2.2.2 Gradient Boosting Classifier (Resampling) with 100 estimators

Cardio Figure 6b: Gradient Boosting (Resampling) Performance with 100 estimators and max depth from 1-20
4.5.2.2.3 Gradient Boosting Classifier (Resampling) with 150 estimators

Cardio Figure 6c: Gradient Boosting (Resampling) Performance with 150 estimators and max depth from 1-20

In our investigation of various combinations of max depth and number of estimators for the Gradient Boosting Classifier with Resampling, shown in Cardio Figure 6a), 6b), and 6c), we observed that increasing the max depth generally improves performance on the training set but may lead to overfitting on the testing set. The optimal max depth appears to be around 12, where a good balance between model complexity and generalization is achieved across different configurations of n_estimators.

5 References

3.2.4.3.5. sklearn.ensemble.GradientBoostingClassifier — scikit-learn 0.20.3 documentation. (2009). Scikit-Learn.org. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

Alves, L. M. (2021, July 2). KNN (K Nearest Neighbors) and KNeighborsClassifier — What it is, how it works, and a practical…https://luis-miguel-code.medium.com/knn-k-nearest-neighbors-and-kneighborsclassifier-what-it-is-how-it-works-and-a-practical-914ec089e467

Giola, C., Danti, P., & Magnani, S. (2021, July 13). Learning curves: A novel approach for robustness improvement of load forecasting. MDPI. https://www.mdpi.com/2673-4591/5/1/38#metrics

IBM. (2022). What Is Logistic Regression? IBM.https://www.ibm.com/topics/logistic-regression

IBM. (2023a). What is a Decision Tree | IBM.https://www.ibm.com/topics/decision-trees

IBM. (2023b). What is Random Forest? | IBM.https://www.ibm.com/topics/random-forest

Jason Brownlee. (2018, November 20). A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning. Machine Learning Mastery.https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

Nair, R., & Bhagat, A. (2019, April 6). Feature Selection Method To Improve The Accuracy of Classification Algorithm. International Journal of Soft Computing and Engineering. https://www.ijitee.org/wp-content/uploads/papers/v8i6/F3421048619.pdf

Snieder, E., Abogadil, K., & T. Khan, U. (2020). Resampling and ensemble techniques for improving ANN-based high flow forecast accuracy. Department of Civil Engineering, York University. https://hess.copernicus.org/preprints/hess-2020-430/hess-2020-430-manuscript-version4.pdf

Scikit-learn. (2018). sklearn.ensemble.RandomForestClassifier — scikit-learn 0.20.3 documentation. Scikit-Learn.org. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Wizards, D. S. (2023, July 7). Understanding the AdaBoost Algorithm. Medium.https://medium.com/@datasciencewizards/understanding-the-adaboost-algorithm-2e9344d83d9b

Yanminsun, S., Hu, H., Xue, B., Zhang, M., & Zhang, C. (2011). Optimized feature selection and enhanced collaborative representation for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 50(11), 4300-4312.https://www.researchgate.net/publication/263913891_Classification_of_imbalanced_data_a_review

Muralidhar, K. S. V. (2023, July 7). Learning Curve to identify Overfitting and Underfitting in Machine Learning. Medium. https://towardsdatascience.com/learning-curve-to-identify-overfitting-underfitting-problems-133177f38df5#:~:text=Learning%20curve%20of%20an%20overfit%20model%20has%20a%20very%20low

Programmer, P. (2023, May 17). Evaluation Metrics for Classification. Medium. https://medium.com/@impythonprogrammer/evaluation-metrics-for-classification-fc770511052d

What is Overfitting? - Overfitting in Machine Learning Explained - AWS. (n.d.). Amazon Web Services, Inc. Retrieved May 31, 2024, from https://aws.amazon.com/what-is/overfitting/#:~:text=Underfitting%20vs

The links for the three datasets are listed below, as hyperlinks: